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© Speech processor. 

© In a speech processor such as a speech recog- 
niser, the problem of distortion of extracted features 
caused by adaption of the input automatic gain con- 
trol (AGC) during feature extraction is solved by 
storing the AGC's gain coefficient along with the 
energy level of each extracted feature. At the end of 
the sampling period the stored gain coefficients are 
set equal to the minimum stored coefficient and the 
associated energy levels adjusted accordingly. The 
AGC circuit may comprise a digitally switched at- 
tenuator under the control of a microprocessor per- 
forming the speech recognition. 
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SPEECH PROCESSOR 



This invention relates to speech processors 
having automatic gain control, and in particular to 
speech recognisers. 

Automatic speech recognisers work by com- 
paring features extracted from audible, speech sig- 
nals. Features extracted from the speech to be 
recognised are compared with stored features ex- 
tracted from a known utterance. 

For accurate recognition it is important that the 
features extracted from the same word or sound 
then spoken at different times are sufficiently simi- 
lar. However, the large dynamic range of speech 
makes this difficult to achieve, particularly in areas 
such as hands-free telephony where the sound 
level received by the microphone can vary over a 
wide range. In order to compensate for this speech 
level variation, most speech recognisers use some 
form of automatic gain control (AGC). 

-The AGC circuit controls the gain to ensure 
that the average signal level used by the feature 
extractor is as near constant as possible over a 
given time period. Hence quiet speech utterances 
are given greater gain than loud utterances. This 
form of AGC performs well when continuous 
speech is the input signal since after a period of 
time, the circuit gain will optimise the signal level 
to give consistent feature extraction. However, in 
the absence of. speech, the gain of the AGC circuit 
will increase to a level determined by the back- 
ground noise, so that at the onset of a speech 
utterance the gain of the AGC circuit will be set too 
high. During the utterance the gain of the circuit is 
automatically reduced, the speed of the gain 
change being determined by the 'attack 1 time of 
the AGC. The start of the utterance is thus sub- 
jected to a much greater gain and any features 
extracted wilf have a much greater energy content 
than similar features extracted later, when the gain 
has been reduced. 

This distortion effect is dependent on the input 
signal level; the higher the speech level the larger 
is the distortion. Hence the first few features ex- 
tracted will not correspond to the notionally similar 
stored features, and this can often result in poor 
recognition performance. 

The present invention seeks to provide a solu- 
tion to this problem. 

According to the present invention there is 
provided a speech processor comprising an input 
to receive speech signals; signal processing means 
to extract spectral parameters from said speech 
signals; an analogue to digital converter to digitise 
said extracted parameters; an automatic gain con- 
trol means to control the signal level applied to 
said converter; characterised in that the spectral 



parameters are stored at least temporarily, and for 
each such stored parameter a gain coefficient indi- 
cative of the gain applied by the gain control 
means is also stored; and in that at the end of a 

5 sampling period the gain coefficients stored in that 
period are, if different, set equal to the lowest gain 
coefficient stored in that period, the magnitudes of 
the corresponding stored spectral parameters be- 
ing adjusted proportionally. 

w in a speech processor according to the inven- 

tion, configured as a speech recogniser, automatic 
gain control is provided by a digitally switched 
attenuator, the gain of which is determined by the 
microprocessor performing the speech recognition. 

T5 The microprocessor controls the gain to ensure 
that the dynamic range of the Analogue to Digital 
converter (which occurs between feature extraction 
and the microprocessor controlling the recogniser 
even when analogue AGCs are used) is not ex- 

20 ceeded (except during the adaption of the AGC). 
The principal difference between the known ana- 
logue AGCs and the system according to the in- 
vention is that in the latter the microprocessor has 
control of the gain setting and can therefore store 

25 the gain used for each feature extracted. After the 
utterance has finished, the microprocessor can de- 
termine the optimum gain setting for the complete 
utterance. All the features stored are then nor- 
malised to this optimum gain setting. By this 

30 means a consistent set of features are extracted 
independent of the input signal gain. 

Embodiments of the invention will be further 
described and explained by the reference to the 
accompanying drawing, in which 

35 Figure 1 is a schematic diagram of a speech re- 
cogniser according to the present invention. . 

Throughout this patent application the invention 
is described with reference to a speech recogniser 
utilising template-matching, but as those skilled in 

40 the art will be aware, the invention is equally ap- 
plicable to any of the conventional types of speech 
recogniser, including those using stochastic model- 
ling, Markov chains, dynamic-timewarping and 
phoneme-recognition. 

45 Speech recognition is based on comparing en- 

ergy contours from a number (generally 8 to 16) of 
filter channels. While speech is present, the energy 
spectrum from each filter channel is digitized with 
an Analogue to Digital (A-D) converter to produce a 

so . template which is stored in a memory. 
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The initial stage of recognition is known as 
'training' and consists of producing the reference 
templates by speaking to the recogniser the words 
which are to be recognised. Once reference tem- 
plates have been made for the words to be recog- 5 
nised, recognition of speech can be attempted. 

When the recogniser is exposed to an utter- 
ance, it produces a test template which can be 
compared with the reference templates in the 
memory to find the closest match. io 

The fundamental elements of the speech re- 
cogniser according to the present invention are 
shown in Figure 1. Voice signals received by the 
microphone 1 and amplified by amplifier 2 are 
passed to a filter bank 3a. In the filter bank the is 
voice signals are filtered into a plurality (in this 
case 16) of frequency bands, and the signals are 
rectified by rectifier 4. The filtered and rectified 
signals are smoothed by low pass filters 3b and 
then sequentially sampled by a multiplexer 5 which 20 
feeds the resultant single channel signal to the 
DAGC circuit 8 which is turn feeds an, Analogue to 
Digital converter 6 from which the digitized signal 
stream is passed to the controlling microprocessor 

7. 25 

The multiplexer addresses each filter channel 
for 20 microseconds before addressing the next 
one. At the end of each 10 millisecond time slot, 
each channel's sampled energy for that period is 
stored. The templates, which are produced during 30 
training or recognition, consist of upto 100 timeslot 
samples for each filter channel. 

The digital AGC operates in the following way. 
Each time the multiplexer addresses a filter chan- 
nel, the microprocessor assesses the channel's en- 35 
ergy level to determine whether the A-D converter 
has been overloaded and hence that the gain is too 
high. When the microprocessor determines that the 
gain is too high it decrements the AGC's gain by 1 
step, which corresponds to a reduction in gain of 40 
L5dB, and looks again at the channel's energy 
level. The multiplexer does not cycle to the next 
channel until the microprocessor has determined 
that the gain has been reduced sufficiently to pre- 
vent overloading of the A-D converter. When the 45 
multiplexer does cycle to the next filter channel, 
the gain of the AGC circuit is held at the new low 
level unless that level results in the overloading of 
the A-D converter with the new channel's energy 
level, in which case the gain is incremented down so 
as previously described. When the multiplexer has 
addressed the final filter channel, the microproces- 
sor normalises the energy levels of all the channels 
by setting their gain coefficients (which have been 
stored together with the energy level information in 55 
memory associated with the microprocessor) to the 
new minimum established by the microprocessor. 




4 

In this way a consistent set of features are ex- 
tracted independent of the initial output signal gain 
and any changes in the gain during formation of 
the template. 

The speech recogniser is also required to de- 
tect the beginning and end of the speech or word 
with a high degree of accuracy. The speech recog- 
niser according to the present invention uses the 
following technique: 

A. The energy level of the background noise 
is measured and stored for 32 time slots (at 10 
milliseconds a sample) while simultaneously adjust- 
ing (reducing) the gains of the AGC drcuit as 
described above to cope with the maximum noise 
energy. 

B. The maximum energy sample is found by 
adding all the filter values for each time slot, divid- 
ing by 16 (the number of filter channels) and mul- 
tiplying by a gain factor corresponding to the- gain 
of the DAGC circuit, and then comparing each time 
slot to find the maximum. 

C. The threshold which needs to be ex- 
ceeded before speech is deemed to be present is 
set to be equal to 1.5 times the maximum noise 
energy determined in Step B. 

D. The average noise energy for each filter 
channel is found and stored (for each channel it is 
the sum of energies over all 32 time slots, divided 
by 32) to establish a noise template. 

E. Thereafter, the filter bank is scanned ev- 
ery 10 milliseconds and the data is stored in a 
temporary cyclic store, of 100 time samples, until 
the average filter energy exceeds the noise/speech 
threshold calculated in C. 

F. If the noise/speech threshold is not ex- 
ceeded after 32 samples, a check is performed to 
ensure that the gain of the DAGC circuit is not set 
too low. This is done by looking at the maximum 
filter channel value stored in those 32 time slots. If 
that maximum level is 1.5dB or more below the 
maximum acceptable input level for the A-D con- 
verter, the gain of the AGC is incremented by 1 to 
increase the gain by 1.5dB. If the threshold is not 
exceeded after 32 samples and the DAGC setting 
is correct, then the noise/speech threshold is recal- 
culated by finding the maximum energy over the 
last 32 samples (as in B) and multiplying by 1 .5 (as 
in C). 

G. Once the noise/speech threshold has 
been exceeded the filter bank is scanned every 10 
milliseconds and the filter data is stored in mem- 
ory, to form the speech templates, until either 100 
samples have been entered or until the energy 
level drops below the noise/speech threshold for 20 
consecutive samples. As described above, if during 
the data input the A-D converter is overloaded, the 
AGC setting is decremented by 1 and the data for 
that filter channel is reprocessed. If during the scan 
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of the 16 filter channels the gain of the DAGC 
circuit is reduced, the data from ail 16 channels is 
re-input so that ai| the filter data corresponds to the 
same AGC setting. The AGC value used is re- 
corded in memory along with the filter data. The 5 
AGC setting used at the start of each time slot is 
taken from the previous time frame, hence the gain 
can only be reduced (not increased) during the 
speech processing phase. This is not a problem 
since at the end of the template period all the w 
template data is normalised to a uniform AGC 
setting. 

H. To ensure that the start of speech was 
not missed by the speech/noise detector threshold. 

the 15 time samples prior to speech detection are 75 
transferred from the temporary cyclic store to the 
front of the 'speech' template. 

I. If more than 100 samples were processed 
prior to speech being detected, the noise template 

is recalculated by analysing (as in D) the oldest 32 20 
time frames in the temporary cyclic store. If less 
than 100 samples were processed prior to speech 
being detected, the noise template established in 
step D is used in the following steps. 

J. The minimum gain setting of the AGC 25 
over the speech template is then found and both 
the speech and noise templates are normalised to 
this setting, which results in both templates con- 
taining the values that would have, been entered 
had that gain been used from the start. 30 

K. The normalised noise template is then 
subtracted from every time frame of the normalised 
speech template. 

L The maximum energy in the normalised 
speech template is now found and a new 35 
noise/speech threshold calculated - equal to the 
maximum energy minus 18dB. This new threshold 
is used to scan the normalised speech template to 
determine the start and finish points of the speech. 

M. The speech template is then truncated to 40 
the start and finish points and is either stored in 
memory (training) or is used for recognition. The 
following tabular example represents the values 
stored after measuring the background noise for 
320 milliseconds (32 time slots of 10 milliseconds 45 
each). 
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the signal going into the A/D, hence to calculate the "real" 
energy all the filter bank values above would have to be 
doubled . 

Maximum real energy (averaged over ail filters) was:- 410 
Threshold to be exceeded to start/end template recording:- 615 
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Because the invention's primary application is 
to voice recognition it has been described with 
reference to that application. However, as those 
skilled in the art will be aware, the invention is not 
only applicable to voice recognition, but is ap- 
plicable to practically any situation where voice 
signals are processed for feature extraction. 

The speech processor according to the present 
invention is particularly suitable for use in applica- 
tions where background noise and variations in the 
level of that background noise are a problem for 
known speech processors. One such application is 
in hands-free telephony, and in particular hands- 
free telephony involving cellular radio terminals. 
Such terminals are frequently used in cars, where it 
is convenient to use speech recognition to provide 
hands-free call connection and dialling. The prob- 
lem arises however that wind, road and engine 
noise fluctuate widely and make accurate recogni- 
tion of speech difficult Clearly, if speech recogni- 
tion for hands-free telephony is to be fully accept- 
able in this application it is necessary that the 
recogniser accepts and acts correctly in response 
to voiced commands in the presence of back- 
ground noise, without routinely requiring that the 
commands be repeated. 

The improved accuracy of recognition provided 
by the present invention is of particular advantage 
in this application. 

Claims 

.1. A speech processor comprising an input to 
receive speech signals; signal processing means to 
extract spectral parameters from said speech sig- 
nals; an analogue to digital converter to digitise 
said extracted parameters; an automatic gain con- 
trol means to control the signal level applied to 
said converter; characterised in that the spectral 
parameters are stored at least temporarily, and for 
each such stored parameter a gain coefficient indi- 
cative of the gain applied by the gain control 
means is also stored; and in that at the end of a 
sampling period the gain coefficients stored in that 
period are, if different, set equal to the lowest gain 
coefficient stored in that period, the magnitudes of 
the corresponding stored spectral parameters be- 
ing adjusted proportionally. 

2. A speech processor as claimed in claim 1 in 
which each extracted spectral parameter corre- 
sponds to the energy content of a particular fre- 
quency band in a time slot of length t. further 
characterised in that for each extracted parameter 
the signal level applied to the analogue to digital 
converter is determined in a small fraction of time 
t, and if the signal level is greater than a predeter- 
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mined level the gain is reduced and the signal level 
re-assessed, the signal strength assessment and 
the gain reduction being repeated within time .slot t 
until the signal level is at a finalised level not 

5 exceeding said predetermined level. 

3. A speech processor as claimed in claim 2 
wherein said predetermined level is equal to the 
maximum level which does not exceed the dy- 
namic range of the analogue to digital converter. 

70 4. A speech processor as claimed in claim 2 or 

claim 3 wherein in a single time slot of length t 
spectral parameters are established for a plurality 
of discrete frequency bands, further characterised 
in that the different frequency bands are addressed 

15 sequentially, with the finalised gain coefficient of 
any frequency band being used as the initial gain 
coefficient of the next addressed frequency band. 

5. A speech processor as claimed in any one 
of claims 2 to 4 wherein the sampling period is 

20 made up of a plurality of time slots of length t 

6. A speech processor as claimed in sny one 
of the preceding claims, configured as ^a speech 
recogniser. 

7. A speech processor as claimed in any one 
25 of the preceding claims, wherein the gain control 

means comprises a digitally switched attenuator 
under the control of a microprocessor one of 
whose inputs is connected to the digitised output of 
the analogue to digital converter, the gain of the 
30 attenuator being determined by the microproces- 
sor. 

8. A cellular radio terminal comprising a 
speech recogniser for selecting functions in re- 
sponse to voiced instructions, characterised in that 

35 the speech recogniser comprises a speech proces- 
sor as claimed in any one of claims 1 to 5. 
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